Open-vocabulary spoken-document retrieval based on query expansion using related web documents
نویسندگان
چکیده
This paper proposes a new method for open-vocabulary spoken-document retrieval based on query expansion using related Web documents. A large vocabulary continuous speech recognition (LVCSR) system first transcribes spoken documents into word sequences, which are then segmented into semantically cohesive units (i.e., stories) using a text segmentation technique. Given a text query word, Web documents containing the query word are first retrieved. Each retrieved Web document can be regarded as an expanded form of the original query word. Spoken documents relevant to the query word are then retrieved by searching for the stories with the LVCSR result similar to the previously obtained Web documents. Experimental results show that the proposed method is quite effective in retrieving spoken documents such as broadcast news programs with out-of-vocabulary (OOV) queries. In addition, the proposed method is also useful for ranking retrieval results with in-vocabulary (IV) queries.
منابع مشابه
Effects of Query Expansion for Spoken Document Passage Retrieval
One of the major challenges for spoken document retrieval is how to handle speech recognition errors within the target documents. Query expansion is promising for this challenge. In this paper, we apply relevance models, a type of query expansion method, for the spoken document passage retrieval task. We adapted the original relevance model for passage retrieval. We also extended it to benefit ...
متن کاملLanguage Model Expansion Using Webdata for Spoken Document Retrieval
In recent years, there has been increasing demand for ad hoc retrieval of spoken documents. We can use existing text retrieval methods by transcribing spoken documents into text data using a Large Vocabulary Continuous Speech Recognizer (LVCSR). However, retrieval performance is severely deteriorated by recognition errors and out-of-vocabulary (OOV) words. To solve these problems, we previously...
متن کاملRRLUFF: Ranking function based on Reinforcement Learning using User Feedback and Web Document Features
Principal aim of a search engine is to provide the sorted results according to user’s requirements. To achieve this aim, it employs ranking methods to rank the web documents based on their significance and relevance to user query. The novelty of this paper is to provide user feedback-based ranking algorithm using reinforcement learning. The proposed algorithm is called RRLUFF, in which the rank...
متن کاملTowards a context sensitive approach to searching information based on domain specific knowledge sources
In the context of document retrieval in the biomedical domain, this paper introduces a novel approach to searching for biomedical information using contextual semantic information. More specifically, we propose to combine the contextual semantic information in documents and user queries in an attempt to improve the performance of biomedical information retrieval (IR) systems. Contextual informa...
متن کاملSpoken document retrieval method combining query expansion with continuous syllable recognition for NTCIR-SpokenDoc
In this paper, we propose a spoken document retrieval method which combines query expansion with continuous syllable recognition. The proposed method expands a query by using words from the web pages collected by a search engine. It is assumed that relevant document vectors exist on the plane which is constructed from the query vector and the extended vector. The weight parameter between a targ...
متن کامل